CREDIT RISK ANALYSIS ¶

INTRODUCTION¶

Credit risk analysis plays a crucial role in the financial industry, enabling lenders to assess the creditworthiness of potential borrowers and make informed decisions about lending. With the increasing availability of data and advancements in machine learning techniques, credit risk analysis has seen significant improvements in accuracy and efficiency.

In this Jupyter Notebook, we will explore the process of credit risk analysis using real-world credit data. Our goal is to build a predictive model that can classify borrowers into riscky and not-riscky categories, helping financial institutions minimize losses and maximize profitability.

DATA OVERVIEW¶

The dataset used in this analysis contains information about various borrowers, including their age, income, loan intent, loan amount, and previous credit history. Additionally, it includes the loan grade, which indicates the level of risk associated with each loan application (ranging from "A" for low risk to "G" for high risk) and many more features.

Description of the data

feature description
person_age The person's age in years
person_income The person's annual income.
person_home_ownership The type of home ownership (RENT, OWN, MORTGAGE, OTHER)
person_emp_length the person's employment length in years.
loan_intent the person's intent for the loan (PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEEMPROVEMENT, DEBTCONSOLIDATION).
loan_grade the of risk on the loan(A,B,C,D,E,F,G)(A-> not riscky | G-> very riscky
loan_amnt the loan amount.
loan_int_rate the loan interest rate (between 6% and 21%)
loan_status Shows wether the loan is currently in default with 1 being default and 0 being non-default.
loan_percent_income The percentage of person's income dedicated for the mortgage.
cb_person_default_on_file If the person has a default history (YES , NO).
cb_person_cred_hist_length The person's credit history.

Project Steps¶

1. Exploratory Data Analysis (EDA): Through EDA, we will gain insights into the distribution of various features, explore correlations, and identify potential patterns or trends.

2. Data Preprocessing: We will begin by cleaning and preprocessing the data to handle missing values, encode categorical variables, and prepare the data for modeling.

3. Feature Selection: To build an effective credit risk model, we will select relevant features and examine their impact on the target variable.

4. Model Building Using machine learning algorithms such as XGBoost, Random Forest, and Logistic Regression, we will train predictive models to classify borrowers as low-risk or high-risk.

5. Hyperparameter Tuning: Fine-tuning the model hyperparameters will help optimize their performance and make more accurate predictions.

6. Model Evaluation: We will evaluate the performance of each model using appropriate metrics, such as accuracy, precision, recall, and F1 score.

7. Credit Risk Prediction: Using the selected model, we will predict the credit risk of new loan applicants and classify them into appropriate risk categories.

8. Conclusion: Finally, we will summarize our findings, discuss the model's effectiveness, and provide recommendations for future improvements.

Table of Contents ¶

  • importing and understanding the data
  • cleaning the data
  • Exploring and visualizing the data:
    • analysing the categorical features
    • analysing the numirical features
    • analysing the target feature
  • preprocessing the data
    • checking / dealing with missing data
    • removing outliers based on observations & domain knowledge
    • creating the main pipeline
    • oversampling & dealing with data imbalance
  • Training Models
    • hyperparameter tuning / performing Grid Search with cross-validation on each model
  • Evaluation of the Results
    • scores of the models with different metrics ('Accuracy', 'F1 Score', 'MSRE',...)
    • Learning curve of the most performant models
    • confusion matrix
  • pickling the best model
  • webapp implimentation with Streamlit
  • Conclusion
In [1]:
## Basic Libraries:
import pandas as pd
pd.options.display.max_colwidth=150   ## this is used to set the column width.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
import warnings
import joblib
warnings.filterwarnings("ignore")
%matplotlib inline 

## For making sample data:
from sklearn.datasets import make_classification

## For Preprocessing: 
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, RepeatedKFold,RepeatedStratifiedKFold
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error 

# from sklearn.base import TransformerMixin,BaseEstimator

## Using imblearn library:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

## Using msno Library for Missing Value analysis:
import missingno as msno

## For Metrics:
from sklearn.metrics import plot_precision_recall_curve,accuracy_score
from sklearn.metrics import plot_confusion_matrix, confusion_matrix, classification_report
from sklearn.model_selection import learning_curve

## For Machine Learning Models:
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier,RandomForestRegressor
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
import pickle

## Setting the seed to allow reproducibility
np.random.seed(31415)

1. Importing & Understanding the Dataset ¶

In [2]:
df = pd.read_csv("./credit_risk_dataset.csv")
df.head(10)
Out[2]:
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_status loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 22 59000 RENT 123.0 PERSONAL D 35000 16.02 1 0.59 Y 3
1 21 9600 OWN 5.0 EDUCATION B 1000 11.14 0 0.10 N 2
2 25 9600 MORTGAGE 1.0 MEDICAL C 5500 12.87 1 0.57 N 3
3 23 65500 RENT 4.0 MEDICAL C 35000 15.23 1 0.53 N 2
4 24 54400 RENT 8.0 MEDICAL C 35000 14.27 1 0.55 Y 4
5 21 9900 OWN 2.0 VENTURE A 2500 7.14 1 0.25 N 2
6 26 77100 RENT 8.0 EDUCATION B 35000 12.42 1 0.45 N 3
7 24 78956 RENT 5.0 MEDICAL B 35000 11.11 1 0.44 N 4
8 24 83000 RENT 8.0 PERSONAL A 35000 8.90 1 0.42 N 2
9 21 10000 OWN 6.0 VENTURE D 1600 14.74 1 0.16 N 3

Basic information:¶

In [3]:
df.shape[0],df.shape[1]
Out[3]:
(32581, 12)
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32581 entries, 0 to 32580
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   person_age                  32581 non-null  int64  
 1   person_income               32581 non-null  int64  
 2   person_home_ownership       32581 non-null  object 
 3   person_emp_length           31686 non-null  float64
 4   loan_intent                 32581 non-null  object 
 5   loan_grade                  32581 non-null  object 
 6   loan_amnt                   32581 non-null  int64  
 7   loan_int_rate               29465 non-null  float64
 8   loan_status                 32581 non-null  int64  
 9   loan_percent_income         32581 non-null  float64
 10  cb_person_default_on_file   32581 non-null  object 
 11  cb_person_cred_hist_length  32581 non-null  int64  
dtypes: float64(3), int64(5), object(4)
memory usage: 3.0+ MB
In [5]:
df.describe()
Out[5]:
person_age person_income person_emp_length loan_amnt loan_int_rate loan_status loan_percent_income cb_person_cred_hist_length
count 32581.000000 3.258100e+04 31686.000000 32581.000000 29465.000000 32581.000000 32581.000000 32581.000000
mean 27.734600 6.607485e+04 4.789686 9589.371106 11.011695 0.218164 0.170203 5.804211
std 6.348078 6.198312e+04 4.142630 6322.086646 3.240459 0.413006 0.106782 4.055001
min 20.000000 4.000000e+03 0.000000 500.000000 5.420000 0.000000 0.000000 2.000000
25% 23.000000 3.850000e+04 2.000000 5000.000000 7.900000 0.000000 0.090000 3.000000
50% 26.000000 5.500000e+04 4.000000 8000.000000 10.990000 0.000000 0.150000 4.000000
75% 30.000000 7.920000e+04 7.000000 12200.000000 13.470000 0.000000 0.230000 8.000000
max 144.000000 6.000000e+06 123.000000 35000.000000 23.220000 1.000000 0.830000 30.000000

2. Cleaning the data ¶

data cleaning plan:¶

1- checking / removing duplicates

2- feature selection

3- removing outliers based on data knowledge

4- Checking for Missing Data

  • checking / removing duplicates
In [6]:
## Checking for Duplicates
dups = df.duplicated()
dups.value_counts() #There are 165 Duplicated rows
Out[6]:
False    32416
True       165
dtype: int64
In [7]:
## Removing the Duplicates
df.drop_duplicates(inplace=True)
  • loan_int_rate describes the Interest Rate offered on Loans by Banks or any financial institution. There is no fixed value as it varies from bank to bank. Hence I am removing this column for our analysis.
In [8]:
df.drop(['loan_int_rate'],axis=1,inplace=True)
In [9]:
ccol=df.select_dtypes(include=["object"]).columns
ncol=df.select_dtypes(include=["int","float"]).columns

print("The number of Categorical columns are:",len(ccol))
print("The number of Numerical columns are:",len(ncol))
The number of Categorical columns are: 4
The number of Numerical columns are: 7
  • Printing the different columns with their cardinality (number of unique elements in each column):
In [10]:
print("The NUMERICAL columns are:\n")
for i in ncol:
    print("->",i,"-",df[i].nunique())
    
print("\n---------------------------\n")
print("The CATEGORICAL columns are:\n")
for i in ccol:
    print("->",i,"-",df[i].nunique())
The NUMERICAL columns are:

-> person_age - 58
-> person_income - 4295
-> person_emp_length - 36
-> loan_amnt - 753
-> loan_status - 2
-> loan_percent_income - 77
-> cb_person_cred_hist_length - 29

---------------------------

The CATEGORICAL columns are:

-> person_home_ownership - 4
-> loan_intent - 6
-> loan_grade - 7
-> cb_person_default_on_file - 2
  • Checking ranges of numerical variables
In [11]:
for col in ncol:
    min_value = df[col].min()
    max_value = df[col].max()
    print(f'Range for {col} : [{min_value} to {max_value}]')
Range for person_age : [20 to 144]
Range for person_income : [4000 to 6000000]
Range for person_emp_length : [0.0 to 123.0]
Range for loan_amnt : [500 to 35000]
Range for loan_status : [0 to 1]
Range for loan_percent_income : [0.0 to 0.83]
Range for cb_person_cred_hist_length : [2 to 30]

3.Exploring and visualizing the data ¶

3.1 Analysing Categorical features ¶

In [12]:
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
    plt.subplot(2,3, index+1)
    sns.countplot(x=col, hue='loan_status', data=df, palette='Blues')
    plt.xticks(rotation=90)
plt.tight_layout()
In [13]:
# Individual frequency plot
plt.figure(figsize=(10,7))
for index, col in enumerate(ccol):
    plt.subplot(2,3, index+1)
    sns.countplot(x=col, palette='Blues', data= df)
    plt.xticks(rotation=90)
plt.tight_layout()
In [14]:
loan_intent_counts = df['loan_intent'].value_counts()

# Create the pie chart using Plotly
fig = px.pie(loan_intent_counts, names=loan_intent_counts.index, values=loan_intent_counts.values,
             title='Pie Chart of Loan Intent', color_discrete_sequence=px.colors.sequential.Viridis)

# Show the plot
fig.show()
In [15]:
mean_income_by_ownership = df.groupby('person_home_ownership')['person_income'].mean().reset_index()

# Create the bar plot using Plotly
fig = px.bar(mean_income_by_ownership, x='person_home_ownership', y='person_income',
             title='Mean Person Income by Home Ownership', color='person_home_ownership',
             color_discrete_sequence=px.colors.sequential.Viridis)

# Show the plot
fig.show()

3.2 Analysing Continuous features ¶

In [16]:
plt.figure(figsize=(8, 6))
df['person_age'].plot.hist(bins=10, color='skyblue', edgecolor='black')

plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Distribution of Person Age')

plt.show()
In [17]:
plt.figure(figsize=(8, 6))
df.boxplot(column='person_income', vert=False)

plt.xlabel('Income')
plt.title('Boxplot of Person Income')

plt.show()

toutes les person_incomes de plus de 1.5M c des valeurs aberantes

In [18]:
# Generate the correlation matrix
correlation_matrix = df.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Add title
plt.title('Correlation Heatmap')

# Show the plot
plt.show()

person_age -> cv_person_cred_hist_length: Une forte corrélation positive entre l'âge de la personne et la longueur de l'historique de crédit peut indiquer que les personnes plus âgées ont tendance à avoir des historiques de crédit plus longs. Cela est généralement attendu, car les personnes plus âgées ont eu plus de temps pour établir leur historique de crédit.

loan_amount -> loan_percent_income: La forte corrélation positive entre le montant du prêt et le pourcentage du revenu alloué au prêt suggère que les montants de prêts accordés augmentent généralement à mesure que le pourcentage du revenu alloué au remboursement du prêt augmente. Cela peut indiquer que les prêteurs accordent des montants de prêts plus élevés à ceux qui consacrent une plus grande partie de leur revenu au remboursement.

loan_amount -> person_income: La forte corrélation positive entre le montant du prêt et le revenu de la personne indique que les personnes avec des revenus plus élevés ont tendance à obtenir des montants de prêts plus élevés. Cela est généralement attendu, car un revenu plus élevé peut être associé à une plus grande capacité de remboursement.

loan_status -> loan_percent_income: La forte corrélation positive entre le statut du prêt et le pourcentage du revenu alloué au prêt suggère que les prêts avec des pourcentages de revenu plus élevés peuvent avoir des chances plus élevées d'être en défaut.

loan_status -> loan_int_rate: La forte corrélation positive entre le statut du prêt et le taux d'intérêt indique que les prêts avec des taux d'intérêt plus élevés peuvent avoir des chances plus élevées d'être en défaut.

person_income -> loan_percent_income: La forte corrélation négative entre le revenu de la personne et le pourcentage du revenu alloué au prêt indique que les personnes avec un revenu plus élevé allouent généralement une plus petite partie de leur revenu au remboursement du prêt.

In [19]:
# Create the scatter plot using Plotly
fig = px.scatter(df, x='person_age', y='person_income', title='Scatter Plot of Age vs. Income', color='person_income',
                 color_continuous_scale=px.colors.sequential.Viridis)

# Show the plot
fig.show()
In [20]:
# Calculate the sum of 'person_income' for each category of 'person_home_ownership'
income_by_ownership = df.groupby('person_home_ownership')['person_income'].sum().reset_index()

# Get the list of categories and the total income for each category
categories = income_by_ownership['person_home_ownership']
total_income = income_by_ownership['person_income']

# Create the stacked bar plot using Pyplot
plt.figure(figsize=(10, 6))
plt.bar(categories, total_income, color='skyblue')

# Add labels and title
plt.xlabel('Home Ownership')
plt.ylabel('Total Income')
plt.title('Total Income by Home Ownership')

# Show the plot
plt.show()
In [21]:
# Scatter plot: person_age vs. cv_person_cred_hist_length
plt.figure(figsize=(8, 6))
plt.scatter(df['person_age'], df['cb_person_cred_hist_length'], marker='o', color='skyblue')
plt.xlabel('Person Age')
plt.ylabel('Credit History Length')
plt.title('Scatter Plot: Person Age vs. Credit History Length')
plt.show()

# Scatter plot: loan_amount vs. loan_percent_income
plt.figure(figsize=(8, 6))
plt.scatter(df['loan_amnt'], df['loan_percent_income'], marker='o', color='green')
plt.xlabel('Loan Amount')
plt.ylabel('Loan Percent Income')
plt.title('Scatter Plot: Loan Amount vs. Loan Percent Income')
plt.show()

# Scatter plot: loan_amount vs. person_income
plt.figure(figsize=(8, 6))
plt.scatter(df['loan_amnt'], df['person_income'], marker='o', color='orange')
plt.xlabel('Loan Amount')
plt.ylabel('Person Income')
plt.title('Scatter Plot: Loan Amount vs. Person Income')
plt.show()
In [22]:
numerical_columns = ncol

# Create histograms for each numerical column
plt.figure(figsize=(12, 8))

for i, col in enumerate(numerical_columns[:6], 1):  # Limit to 6 columns to fit in the grid
    plt.subplot(2, 3, i)
    plt.hist(df[col], bins=20, edgecolor='black')
    plt.xlabel(col)
    plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

3.3 Analysing target feature ¶

In [23]:
# Box plot: loan_status vs. loan_percent_income
plt.figure(figsize=(8, 6))
plt.boxplot([df[df['loan_status'] == 0]['loan_percent_income'],
             df[df['loan_status'] == 1]['loan_percent_income']],
            labels=['Paid', 'Default'], showfliers=False, notch=True, patch_artist=True)
plt.xlabel('Loan Status')
plt.ylabel('Loan Percent Income')
plt.title('Box Plot: Loan Status vs. Loan Percent Income')
plt.show()
In [24]:
sns.countplot(x=df['loan_status'], palette='Oranges')
plt.title('Distribution of Risk')
plt.show()
In [25]:
df['loan_status'].value_counts().plot(kind='pie', autopct='%1.2f%%', explode=[0,0.1], shadow=True)
Out[25]:
<AxesSubplot:ylabel='loan_status'>

The Data is highly IMBALANCED. We will deal with oversampling techniques like KNN-SMOTE to solve this issue.

4. Preprocessing the data ¶

4.1 Checking / dealing with missing data: ¶

Missing data, or missing values, occur when you don’t have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons.

There are typically 3 types of missing values:

  1. Missing completely at random (MCAR)

  2. Missing at random (MAR)

  3. Missing not at random (MNAR)

Problems: Missing data are problematic because, depending on the type, they can sometimes cause sampling bias. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample.

In [26]:
df.isnull().any()
Out[26]:
person_age                    False
person_income                 False
person_home_ownership         False
person_emp_length              True
loan_intent                   False
loan_grade                    False
loan_amnt                     False
loan_status                   False
loan_percent_income           False
cb_person_default_on_file     False
cb_person_cred_hist_length    False
dtype: bool
In [27]:
df.isna().sum()
Out[27]:
person_age                      0
person_income                   0
person_home_ownership           0
person_emp_length             887
loan_intent                     0
loan_grade                      0
loan_amnt                       0
loan_status                     0
loan_percent_income             0
cb_person_default_on_file       0
cb_person_cred_hist_length      0
dtype: int64
In [28]:
msno.bar(df)
Out[28]:
<AxesSubplot:>

NOTE: EVERY PREPROCESSING TECHNIQUE IS DONE ONLY ON THE TRAIN SET. SO SPLITTING IS MANDATORY BEFORE OUTLIER REMOVAL, MISSING VALUES HANDLING, OVERSAMPLING, ETC...

In [29]:
# we split the data to train / test parts
X_train, X_test, y_train, y_test = train_test_split(df.drop('loan_status', axis=1), df['loan_status'],
                                        random_state=0,  test_size=0.2, stratify=df['loan_status'],
                                        shuffle=True)
In [ ]:

In [30]:
#To print the number of unique values:
for col in X_train:
    print(col, '--->', X_train[col].nunique())
    if X_train[col].nunique()<20:
        print(X_train[col].value_counts(normalize=True)*100)
    print()
person_age ---> 58

person_income ---> 3680

person_home_ownership ---> 4
RENT        50.320068
MORTGAGE    41.439149
OWN          7.916859
OTHER        0.323924
Name: person_home_ownership, dtype: float64

person_emp_length ---> 36

loan_intent ---> 6
EDUCATION            19.809502
MEDICAL              18.787598
VENTURE              17.542033
PERSONAL             16.878760
DEBTCONSOLIDATION    15.968687
HOMEIMPROVEMENT      11.013420
Name: loan_intent, dtype: float64

loan_grade ---> 7
A    32.932284
B    32.126330
C    19.902052
D    11.121394
E     3.004010
F     0.732685
G     0.181243
Name: loan_grade, dtype: float64

loan_amnt ---> 710

loan_percent_income ---> 75

cb_person_default_on_file ---> 2
N    82.392411
Y    17.607589
Name: cb_person_default_on_file, dtype: float64

cb_person_cred_hist_length ---> 29

4.2 Removing Outliers based on Observations & Domain Knowledge: ¶

  • We can exclude clients older than 80 yo
In [31]:
X_train.loc[X_train['person_age']>=80, :]  
X_train = X_train.loc[X_train['person_age']<=80, :]
  • We can also exclude rows whose work experience is >60 (Assuming average Upper bound of employement).
In [32]:
X_train.loc[X_train['person_emp_length']>=60, :]
X_train = X_train.loc[X_train['person_emp_length']<60, :]
  • we will also exclude people making more than 2M / year
In [33]:
X_train.loc[X_train['person_income']>=2000000, :]
X_train = X_train.loc[X_train['person_income']<=2000000, :]
In [34]:
y_train = y_train[X_train.index]
y_train.shape
Out[34]:
(25196,)

4.3 Creating the main pipeline ¶

The Main Pipeline will be made of two parts:¶
  • Preprocessing for NUMERICAL VARIABLES:

1- Iterative imputer - To handle missing values

2- Scaling - To maintain the scale among features

  • Preprocessing for CATEGORICAL VARIABLES:

1- One Hot Encoder - To encode each categoric for model interpretability

  • Finally we apply, SMOTE - To handle imbalance in the dataset
In [35]:
#Create the main pipeline for preprocessing numerical variables:
numerical_pipeline = Pipeline([
    ('imputer', IterativeImputer()),  # Impute missing values using iterative imputer
    ('scaler', StandardScaler())     # Scale numerical features
])
In [36]:
#Create the pipeline for preprocessing categorical variables:
categorical_pipeline = Pipeline([
    ('encoder', OneHotEncoder())  # One-hot encode categorical features
])
In [37]:
# Replace 'numerical_features' and 'categorical_features' with lists of your numerical and categorical feature names
numerical_features = X_train.select_dtypes(include='number').columns.tolist()
categorical_features = X_train.select_dtypes(include='object').columns.tolist()

preprocessor = ColumnTransformer([
    ('numerical', numerical_pipeline, numerical_features),
    ('categorical', categorical_pipeline, categorical_features)
])
In [38]:
#Fit and transform the main pipeline on the training data:
X_train_preprocessed = preprocessor.fit_transform(X_train)
In [39]:
def fit_preprocessing_pipeline(X_train):
    return preprocessing_pipeline.fit(X_train)
In [40]:
joblib.dump(preprocessor, 'preprocessing_pipeline.pkl')
Out[40]:
['preprocessing_pipeline.pkl']

4.4 over-sampling: ¶

dealing with data imbalance¶

In [41]:
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_preprocessed, y_train)
In [42]:
# Replace numeric class labels with words
class_labels_mapping = {0: 'paid', 1: 'default'}
y_train_mapped = y_train.map(class_labels_mapping)
y_train_balanced_mapped = pd.Series(y_train_balanced).map(class_labels_mapping)

# Create bar plot for class distribution before SMOTE with words
plt.figure(figsize=(6, 4))
y_train_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution Before SMOTE')
plt.xticks(rotation=0)
plt.show()

# Create bar plot for class distribution after SMOTE with words
plt.figure(figsize=(6, 4))
y_train_balanced_mapped.value_counts().plot(kind='bar')
plt.xlabel('loan_status')
plt.ylabel('Count')
plt.title('Class Distribution After SMOTE')
plt.xticks(rotation=0)
plt.show()
In [43]:
X_test_processed = preprocessor.fit_transform(X_test)

5. Training models: ¶

5.1 hyperparameter tuning / performing Grid Search with cross-validation on each model ¶

  • performing Grid Search with cross-validation on each model using the specified hyperparameter grid
In [44]:
# Define the models and their respective hyperparameter grids
models = {
    'XGBoost': (XGBClassifier(), {'n_estimators': [i*100 for i in range(4)], 'max_depth': [6,8,10], 'learning_rate': [0.01, 0.05, 0.1]}),
    'Logistic Regression': (LogisticRegression(), {'C': [0.01, 0.1, 1, 10]}),
    #'SVM': (SVC(), {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}),
    'Neural Network': (MLPClassifier(), {'hidden_layer_sizes': [(100,), (100, 50)], 'activation': ['relu', 'tanh']}),
    'Random Forest': (RandomForestClassifier(random_state=0, class_weight='balanced'), {'n_estimators': [100, 200, 300], 'max_depth': [None, 5, 10]}),
}

# Define a dictionary to store the evaluation metrics for each model
evaluation_metrics = {
    'Model': [],
    'Cross-Val Score': [],
    'Accuracy': [],
    'F1 Score': [],
    'MSRE': []
}

# Create a dictionary to store the best models
best_models = {}

# Perform cross-validation and hyperparameter tuning for each model
for model_name, (model, param_grid) in models.items():
    grid_search = GridSearchCV(model, param_grid, cv=5)
    grid_search.fit(X_train_balanced, y_train_balanced)

    print(f"Model: {model_name}")
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best cross-validation score: {grid_search.best_score_:.3f}\n")
    
    # Append the evaluation metrics to the dictionary
    evaluation_metrics['Model'].append(model_name)
    evaluation_metrics['Cross-Val Score'].append(grid_search.best_score_)
    
    # Get the best model
    best_model = grid_search.best_estimator_
    
    # Store the best model in the dictionary
    best_models[model_name] = best_model

    # Predict the test set using the best model
    y_pred = best_model.predict(X_test_processed)

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    evaluation_metrics['Accuracy'].append(accuracy)

    # Calculate F1 score
    f1 = f1_score(y_test, y_pred, average='weighted')
    evaluation_metrics['F1 Score'].append(f1)

    # Calculate MSRE
    msre = mean_squared_error(y_test, y_pred)
    evaluation_metrics['MSRE'].append(msre)

# Convert the dictionary to a Pandas DataFrame for easy plotting
metrics_df = pd.DataFrame(evaluation_metrics)

# Plot the evaluation metrics
plt.figure(figsize=(10, 6))
plt.bar(metrics_df['Model'], metrics_df['Cross-Val Score'], label='Cross-Val Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['Accuracy'], label='Accuracy', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['F1 Score'], label='F1 Score', alpha=0.7)
plt.bar(metrics_df['Model'], metrics_df['MSRE'], label='MSRE', alpha=0.7)
plt.xticks(rotation=45)
plt.xlabel('Model')
plt.ylabel('Score')
plt.title('Model Evaluation Metrics')
plt.legend()
plt.tight_layout()
plt.show()

# After the loop, training is complete
print("Training completed!")
Model: XGBoost
Best parameters: {'learning_rate': 0.1, 'max_depth': 8, 'n_estimators': 300}
Best cross-validation score: 0.955

Model: Logistic Regression
Best parameters: {'C': 10}
Best cross-validation score: 0.801

Model: Neural Network
Best parameters: {'activation': 'tanh', 'hidden_layer_sizes': (100, 50)}
Best cross-validation score: 0.908

Model: Random Forest
Best parameters: {'max_depth': None, 'n_estimators': 300}
Best cross-validation score: 0.948

Training completed!

6. Evaluation of the results ¶

In [45]:
evaluation_metrics = pd.DataFrame(evaluation_metrics)

6.1 scores of the models with different metrics ('Accuracy', 'F1 Score', 'MSRE',...)¶

In [46]:
metrics_to_plot = ['Cross-Val Score', 'Accuracy', 'F1 Score', 'MSRE']

for metric in metrics_to_plot:
    plt.figure(figsize=(8, 6))
    plt.bar(evaluation_metrics['Model'], evaluation_metrics[metric], alpha=0.7)
    plt.xticks(rotation=45)
    plt.xlabel('Model')
    plt.ylabel('Score')
    plt.title(f'Model Evaluation Metric: {metric}')
    plt.tight_layout()
    plt.show()

6.2 Learning curve of the most performant models¶

In [47]:
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(random_state=0, class_weight='balanced'),
    'Neural Network': MLPClassifier()
}

# Create a function to plot the learning curve
def plot_learning_curve(model, X, y):
    train_sizes, train_scores, test_scores = learning_curve(model, X, y, train_sizes=np.linspace(0.1, 1.0, 5), cv=5, scoring='accuracy')
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)
    
    plt.figure(figsize=(8, 6))
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
    plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='orange')
    plt.plot(train_sizes, train_mean, 'o-', color='blue', label='Training Score')
    plt.plot(train_sizes, test_mean, 'o-', color='orange', label='Cross-Validation Score')
    plt.xlabel('Training Examples')
    plt.ylabel('Score')
    plt.title(f'Learning Curve for {model.__class__.__name__}')
    plt.legend()
    plt.grid(True)
    plt.show()

# Loop over the models and plot the learning curve for each
for model_name, model in models.items():
    plot_learning_curve(model, X_train_balanced, y_train_balanced)

6.3 confusion matrix ¶

In [48]:
# Create a dictionary to store confusion matrices for each model
conf_matrices = {}

# Loop over the models and calculate the confusion matrix for each
for model_name, model in best_models.items():
    y_pred = model.predict(X_test_processed)
    conf_matrix = confusion_matrix(y_test, y_pred)
    conf_matrices[model_name] = conf_matrix

# Plot the confusion matrices
plt.figure(figsize=(12, 8))
for i, (model_name, conf_matrix) in enumerate(conf_matrices.items()):
    plt.subplot(2, 2, i + 1)
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False, square=True)
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(f'Confusion Matrix - {model_name}')
plt.tight_layout()
plt.show()

7. Pickling the best Model: ¶

In [49]:
# Save the ML Pipeline:
joblib.dump(model, 'best_model.pkl')
Out[49]:
['best_model.pkl']

8. WebApp implimentation with streamlit : ¶

9. Conclusion: ¶

in this project i have been exposed to a lot of concepts like:

--> building a pipeline 

--> hyperparameter tuning 

--> evaluating models 

--> building my first streamlit application 

--> deploying it.

In conclusion, this credit risk analysis project demonstrates the power of data science and machine learning in the financial industry. This web app can serve as a valuable tool for financial institutions to assess credit risk, make informed lending decisions, and mitigate potential losses.

However, as with any data science project, there are a few points to keep in mind:

Continuous monitoring: Credit risk is a dynamic domain, and models need regular updates to adapt to changing economic conditions and borrower behaviors.

Model Robustness: Although the achieved accuracy is excellent, it's essential to test the model's robustness on a wider range of scenarios and data distributions.

Ethical Considerations: Credit risk models must be fair and unbiased. Continuously monitor for any potential bias and ensure fairness in lending decisions.

Model Deployment: Deploying a machine learning model in production involves careful considerations, such as scalability, security, and version control.

In [ ]: